3  Grouping Data and Summary Queries

3.1 Reference

Default database in this chapter is “ap”.

Other databases included in the simulation MySQL server:

  • om
  • ex

3.2 Aggregated Functions

These are tools to calculate common summarized statistics, usually used when aggregate records to a lower granularity, hence the name “Aggregated”.

A full list of aggregated functions from MySQL can be found here.

Here is an example of getting a variety of summary stats about invoices with amount outstanding:

SELECT COUNT(*) AS number_of_invoices,
    COUNT(DISTINCT vendor_id) AS number_of_vendors,
    SUM(invoice_total - payment_total - credit_total) AS total_due,
    MAX(invoice_total - payment_total - credit_total) AS highest_invoice_due,
    MIN(invoice_total - payment_total - credit_total) AS lowest_invoice_due
FROM ap.invoices
WHERE invoice_total - payment_total - credit_total > 0
1 records
number_of_invoices number_of_vendors total_due highest_invoice_due lowest_invoice_due
11 7 32020.42 19351.18 30.75

MIN(), MAX() can also be used to work with non-numeric values. They are essentially shortcut to:

  • MIN(): sort the column values in ascending order, and take the first value (such as a from a-z, or the earliest date)
  • MAX(): sort the column values in ascending order, and take the last value (such as z from a-z, or the latest date)
SELECT MIN(vendor_name) AS first_vendor,
    MAX(vendor_name) AS last_vendor,
    COUNT(vendor_name) AS number_of_vendors
FROM ap.vendors
1 records
first_vendor last_vendor number_of_vendors
Abbey Office Furnishings Zylka Design 122

3.3 Grouping Data: GROUP BY

Grouping: the server treats the target table as if it is subset by each unique combination of values.

This is useful when you need to calculate or perform operations for each of those subset independently and then combine them back to a single table.

The following is a good example:

  1. it treats all records by each value in vendor_id separately;
  2. it calculates average invoice amount with invoice records by each vendor_id;
  3. it subsets, based on average of invoice_total greater than 2000;
  4. finally, it sorts the final output by calculated average amount.

IMPORTANT:

You cannot use WHERE for aggregated results from grouped operations.

SELECT vendor_id, ROUND(AVG(invoice_total), 2) AS average_invoice_amount
FROM ap.invoices
GROUP BY vendor_id
HAVING AVG(invoice_total) > 2000
ORDER BY average_invoice_amount DESC
8 records
vendor_id average_invoice_amount
110 23978.48
72 10963.66
104 7125.34
99 6940.25
119 4901.26
122 2575.33
86 2433.00
100 2184.50

However, you can continue using WHERE to subset on non-aggregated fields/dimensions. For example:

SELECT 
    invoice_date,
    COUNT(*) AS invoice_qty,
    SUM(invoice_total) AS invoice_sum
FROM invoices
WHERE invoice_date BETWEEN '2022-06-01' AND '2022-06-30'
GROUP BY invoice_date
HAVING COUNT(*) > 1 
    AND SUM(invoice_total) > 100
ORDER BY invoice_date DESC
8 records
invoice_date invoice_qty invoice_sum
2022-06-24 3 309.80
2022-06-22 4 299.50
2022-06-18 2 6972.20
2022-06-11 4 3867.50
2022-06-10 2 354.30
2022-06-09 2 1129.79
2022-06-03 3 1832.00
2022-06-01 3 24627.03

It also works with table joined to the target home table:

SELECT vendor_state, vendor_city, COUNT(*) AS invoice_qty,
    ROUND(AVG(invoice_total), 2) AS invoice_avg
FROM ap.invoices JOIN ap.vendors
    ON invoices.vendor_id = vendors.vendor_id
GROUP BY vendor_state, vendor_city
ORDER BY vendor_state, vendor_city
Displaying records 1 - 15
vendor_state vendor_city invoice_qty invoice_avg
AZ Phoenix 1 662.00
CA Fresno 19 1208.75
CA Los Angeles 1 503.20
CA Oxnard 3 188.00
CA Pasadena 5 196.12
CA Sacramento 7 253.00
CA San Francisco 3 1211.04
CA Turlock 1 95.00
CA Valencia 1 6940.25
DC Washington 1 600.00
MA Boston 1 1367.50
MI Ann Arbor 5 23978.48
MI Auburn Hills 2 10963.66
NV Reno 9 2575.33
NV The Lake 1 2184.50

Summary Row: WITH ROLLUP

WITH ROLLUP keywords allow you to add a summary row at the end of each group.

See the following example: we obtain the last few rows of aggregated results and an aggregated line of summary:

  • for each value of vendor_state
  • for overall level
SELECT vendor_state, vendor_city, COUNT(*) AS qty_vendors
FROM ap.vendors
WHERE vendor_state IN ('IA', 'NJ')
GROUP BY vendor_state, vendor_city WITH ROLLUP
8 records
vendor_state vendor_city qty_vendors
IA Fairfield 1
IA Washington 1
IA NA 2
NJ East Brunswick 2
NJ Fairfield 1
NJ Washington 1
NJ NA 4
NA NA 6

Summary Row with Level Control: GROUPING()

The best way to see what GROUPING() does, is to see the following examples.

First, let’s what GROUPING() with different variables produces:

SELECT vendor_state,
  vendor_city,
  vendor_zip_code,
  COUNT(*) AS qty_vendors,
  GROUPING(vendor_state), 
  GROUPING(vendor_city), 
  GROUPING(vendor_state, vendor_city),
  GROUPING(vendor_state, vendor_zip_code),
  GROUPING(vendor_city, vendor_zip_code),
  GROUPING(vendor_state, vendor_city, vendor_zip_code)
FROM ap.vendors
WHERE vendor_state IN ('IA', 'NJ')
GROUP BY vendor_state, vendor_city, vendor_zip_code WITH ROLLUP
14 records
vendor_state vendor_city vendor_zip_code qty_vendors GROUPING(vendor_state) GROUPING(vendor_city) GROUPING(vendor_state, vendor_city) GROUPING(vendor_state, vendor_zip_code) GROUPING(vendor_city, vendor_zip_code) GROUPING(vendor_state, vendor_city, vendor_zip_code)
IA Fairfield 52556 1 0 0 0 0 0 0
IA Fairfield NA 1 0 0 0 1 1 1
IA Washington 52353 1 0 0 0 0 0 0
IA Washington NA 1 0 0 0 1 1 1
IA NA NA 2 0 1 1 1 3 3
NJ East Brunswick 08810 1 0 0 0 0 0 0
NJ East Brunswick 08816 1 0 0 0 0 0 0
NJ East Brunswick NA 2 0 0 0 1 1 1
NJ Fairfield 07004 1 0 0 0 0 0 0
NJ Fairfield NA 1 0 0 0 1 1 1
NJ Washington 07882 1 0 0 0 0 0 0
NJ Washington NA 1 0 0 0 1 1 1
NJ NA NA 4 0 1 1 1 3 3
NA NA NA 6 1 1 3 3 3 7

This can help us to easily identify rollup rows at different levels and then to create friendly labels at each level:

SELECT 
  IF(GROUPING(vendor_state) = 1, 'Vendor_State_Count', vendor_state) AS vendor_state,
  IF(GROUPING(vendor_city) = 1, 'Vendor_City_Count', vendor_city) AS vendor_city,
  COUNT(*) AS qty_vendors
FROM ap.vendors
WHERE vendor_state IN ('IA', 'NJ')
GROUP BY vendor_state, vendor_city WITH ROLLUP
8 records
vendor_state vendor_city qty_vendors
IA Fairfield 1
IA Washington 1
IA Vendor_City_Count 2
NJ East Brunswick 2
NJ Fairfield 1
NJ Washington 1
NJ Vendor_City_Count 4
Vendor_State_Count Vendor_City_Count 6

If you want to see rollup rows only:

SELECT 
  IF(GROUPING(vendor_state) = 1, 'Vendor_State_Count', vendor_state) AS vendor_state,
  IF(GROUPING(vendor_city) = 1, 'Vendor_City_Count', vendor_city) AS vendor_city,
  COUNT(*) AS qty_vendors
FROM ap.vendors
WHERE vendor_state IN ('IA', 'NJ')
GROUP BY vendor_state, vendor_city WITH ROLLUP
HAVING GROUPING(vendor_state) = 1 OR GROUPING(vendor_city) = 1
3 records
vendor_state vendor_city qty_vendors
IA Vendor_City_Count 2
NJ Vendor_City_Count 4
Vendor_State_Count Vendor_City_Count 6

3.4 Window: OVER(), PARTITION BY

What if you would like to create grouping effect but you don’t want to lose any granularity of data? This is where “window” concept cuts in. OVER() function creates a window, with PARTITION BY specifies the “grouping effect” for the windows to be created.

The following example creates a sum of invoice_total at different levels. Notice that:

  • total_invoices: created with nothing in OVER(), it will just create a column of sum value
  • vendor_total: created with PARTITION BY in OVER(), it will create subtotals by each value of vendor_id
  • you don’t lose any granularity of the data, you can still choose all the columns from invoices table that you wish to include.
  • you can still use WHERE to subset for target table.
SELECT vendor_id,
  invoice_date,
  invoice_total,
  SUM(invoice_total) OVER() AS total_invoices,
  SUM(invoice_total) OVER(PARTITION BY vendor_id) AS vendor_total
FROM ap.invoices
WHERE invoice_total > 5000
8 records
vendor_id invoice_date invoice_total total_invoices vendor_total
72 2022-06-01 21842.00 155800 21842.00
99 2022-06-18 6940.25 155800 6940.25
104 2022-05-21 7125.34 155800 7125.34
110 2022-05-28 37966.19 155800 119892.41
110 2022-07-19 26881.40 155800 119892.41
110 2022-07-23 20551.18 155800 119892.41
110 2022-07-24 23517.58 155800 119892.41
110 2022-07-31 10976.06 155800 119892.41

Running Windows: OVER(), PARTITION BY, ORDER BY

Comparing with the above, with ORDER BY added, we are now calculating running total for each vendor as each invoice being issued for the same vendor.

“ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW”: determines the running total behaviour. It says operations performed for all rows up to this current row. See window function for vendor_id 110 in action.

SELECT 
  vendor_id,
  invoice_date,
  invoice_total,
  SUM(invoice_total) OVER() AS total_invoices,
  SUM(invoice_total) OVER(PARTITION BY vendor_id
    ORDER BY invoice_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS vendor_total
FROM ap.invoices
WHERE invoice_total > 5000
8 records
vendor_id invoice_date invoice_total total_invoices vendor_total
72 2022-06-01 21842.00 155800 21842.00
99 2022-06-18 6940.25 155800 6940.25
104 2022-05-21 7125.34 155800 7125.34
110 2022-05-28 37966.19 155800 37966.19
110 2022-07-19 26881.40 155800 64847.59
110 2022-07-23 20551.18 155800 85398.77
110 2022-07-24 23517.58 155800 108916.35
110 2022-07-31 10976.06 155800 119892.41

You can modify the “width” of the window. See the following:

  • ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING: allows you to use 1 row up (if any) and 1 row down (if any) relative to the current row to make calculation. Obviously you can change to any numbers as you wish.
  • Now it calculates the average invoice total amounts based on rolling 3 invoice_dates.
SELECT vendor_id,
  invoice_date,
  invoice_total,
  SUM(invoice_total) OVER() AS total_invoices,
  ROUND(AVG(invoice_total) OVER(PARTITION BY vendor_id 
    ORDER BY invoice_date 
    ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING), 2) AS vendor_total
FROM ap.invoices
WHERE invoice_total > 5000
8 records
vendor_id invoice_date invoice_total total_invoices vendor_total
72 2022-06-01 21842.00 155800 21842.00
99 2022-06-18 6940.25 155800 6940.25
104 2022-05-21 7125.34 155800 7125.34
110 2022-05-28 37966.19 155800 32423.80
110 2022-07-19 26881.40 155800 28466.26
110 2022-07-23 20551.18 155800 23650.05
110 2022-07-24 23517.58 155800 18348.27
110 2022-07-31 10976.06 155800 17246.82

If you need to modify the window based on range of a value you can use RANGE instead of ROW keyword.

The following specifies the window to be based on invoice_date and treat a day before and a day after as the same range. (See vendor_id 110 and between 2022-07-19 and 2022-07-31):

SELECT vendor_id,
  invoice_date,
  invoice_total,
  SUM(invoice_total) OVER() AS total_invoices,
  SUM(invoice_total) OVER(PARTITION BY vendor_id 
    ORDER BY invoice_date 
    RANGE BETWEEN INTERVAL 1 DAY PRECEDING AND INTERVAL 1 DAY FOLLOWING) AS vendor_total
FROM ap.invoices
WHERE invoice_total > 5000
8 records
vendor_id invoice_date invoice_total total_invoices vendor_total
72 2022-06-01 21842.00 155800 21842.00
99 2022-06-18 6940.25 155800 6940.25
104 2022-05-21 7125.34 155800 7125.34
110 2022-05-28 37966.19 155800 37966.19
110 2022-07-19 26881.40 155800 26881.40
110 2022-07-23 20551.18 155800 44068.76
110 2022-07-24 23517.58 155800 44068.76
110 2022-07-31 10976.06 155800 10976.06

Create Row Index

The following creates row_number for each group by vendor_state:

SELECT
  ROW_NUMBER() OVER(PARTITION BY vendor_state ORDER BY vendor_name) AS 'row_number',
  vendor_name,
  vendor_state
FROM ap.vendors
Displaying records 1 - 15
row_number vendor_name vendor_state
1 AT&T AZ
2 Computer Library AZ
3 Wells Fargo Bank AZ
1 Abbey Office Furnishings CA
2 American Express CA
3 ASC Signs CA
4 Aztek Label CA
5 Bertelsmann Industry Svcs. Inc CA
6 BFI Industries CA
7 Bill Jones CA
8 Bill Marvin Electric Inc CA
9 Blanchard & Johnson Associates CA
10 Blue Cross CA
11 Blue Shield of California CA
12 Cal State Termite CA

The following compares the differences among rankings and row index:

SELECT 
  vendor_id,
  invoice_total,
  invoice_number,
  ROW_NUMBER() OVER (PARTITION BY vendor_id ORDER BY invoice_total) AS 'row_id', 
  RANK() OVER (PARTITION BY vendor_id ORDER BY invoice_total) AS 'rank', 
  DENSE_RANK() OVER (PARTITION BY vendor_id ORDER BY invoice_total) AS 'dense_rank'
FROM ap.invoices
Displaying records 1 - 15
vendor_id invoice_total invoice_number row_id rank dense_rank
34 116.54 QP58872 1 1 1
34 1083.58 Q545443 2 2 2
37 116.00 547479217 1 1 1
37 224.00 547481328 2 2 2
37 224.00 547480102 3 2 2
48 856.92 P02-88D77S7 1 1 1
72 85.31 39104 1 1 1
72 21842.00 40318 2 2 2
80 90.36 134116 1 1 1
80 175.00 133560 2 2 2
81 936.93 MABO1489 1 1 1
82 600.00 C73-24 1 1 1
83 579.42 31361833 1 1 1
83 1575.00 31359783 2 2 2
86 2433.00 367447 1 1 1

Create Lags and Leads

LAG(): Allows to “push each record down” and lines up the current row value with the value from last row. LEAD(): Allows to “pull each record up” and lines up the current row value with the value from next row.

These are useful to compare results for year over year, month over month etc.

Also notice here that “WINDOW” clause allows to create alias of a particular type of window, so you don’t have to repeat the specifications for OVER() each time.

SELECT 
  rep_id, 
  sales_year,
  -- lag series
  sales_total AS current_sales,
  LAG(sales_total, 1, 0) OVER rep_window AS last_sales,
  Sales_total - LAG(sales_total, 1, 0) OVER rep_window AS change_from_last,
  -- lead series
  sales_total AS current_sales_again,
  LEAD(sales_total, 1, 0) OVER rep_window AS next_sales,
  LEAD(sales_total, 1, 0) OVER rep_window - Sales_total AS change_from_current
FROM ex.sales_totals
WINDOW rep_window AS (PARTITION BY rep_id ORDER BY sales_year)
12 records
rep_id sales_year current_sales last_sales change_from_last current_sales_again next_sales change_from_current
1 2020 1274856.38 0.0 1274856.38 1274856.38 923746.85 -351109.53
1 2021 923746.85 1274856.4 -351109.53 923746.85 998337.46 74590.61
1 2022 998337.46 923746.8 74590.61 998337.46 0.00 -998337.46
2 2020 978465.99 0.0 978465.99 978465.99 974853.81 -3612.18
2 2021 974853.81 978466.0 -3612.18 974853.81 887695.75 -87158.06
2 2022 887695.75 974853.8 -87158.06 887695.75 0.00 -887695.75
3 2020 1032875.48 0.0 1032875.48 1032875.48 1132744.56 99869.08
3 2021 1132744.56 1032875.5 99869.08 1132744.56 0.00 -1132744.56
4 2021 655786.92 0.0 655786.92 655786.92 72443.37 -583343.55
4 2022 72443.37 655786.9 -583343.55 72443.37 0.00 -72443.37
5 2021 422847.86 0.0 422847.86 422847.86 45182.44 -377665.42
5 2022 45182.44 422847.9 -377665.42 45182.44 0.00 -45182.44